July 9-12 2019
These slides: http://www.databrew.cc/malawi
-Introduction (Arsenio)
-Data management plan (Arsenio)
-Data entry (Arsenio)
-Data traceability (Arsenio)
-Practical session (Arsenio and Joe)
-Managing data quality and validation (Joe)
-Data coding (Joe)
-Data sharing (Joe)
-Practical session (Joe and Arsenio)
-Study registration (Joe)
-Information in the protocol (Joe)
-Statistics and and statistical plan (SAP) (Joe)
-Interim analysis (Joe)
-Publication (Joe)
-Data processing (Joe)
-Data cleaning (Joe)
Reason 1: It’s free
Reason 2: It’s “open source”
Reason 3: It’s beautiful
Reason 3: It’s beautiful
Reason 3: It’s beautiful
Reason 3: It’s beautiful
Reason 4: It’s powerful
Reason 5: It’s fun
-Introduction (Arsenio)
-Data management plan (Arsenio)
-Data entry (Arsenio)
-Data traceability (Arsenio)
-Practical session (Arsenio and Joe)
Let’s generate the following together:
-A research hypothesis
-A basic DMP
-A basic EDC (electronic data capture) entry system (google sheets)
-A scripted audit / log system (R)
This should be a falsifiable, generalized statement about the people in this room.
Example: the older people are, the more they like dancing.
This should be a list of bullet points (5-15), including
(If time permits, we’ll bulid this together later in the week)
Download R: https://www.r-project.org/
Download RStudio: https://www.rstudio.com/products/rstudio/download/
Let’s write some code!
2 + 2
Let’s write some code!
2 + 2
[1] 4
Let’s write some code!
x <- c(1,2,3,4,5)
Let’s write some code!
x
[1] 1 2 3 4 5
Let’s write some code!
barplot(x)
A “package” is simply a collection of code written by someone else.
It’s what makes R powerful, but also confusing.
You only have to install a package one time.
install.packages('tidyverse')
You have to use the library function every time you use a package.
library(tidyverse)
Writing library just means “I am going to use this package”.
Install the following packages:
tidyverse maptools RColorBrewer ggthemes knitr leaflet raster rgdal rgeos rmarkdown sp tidyr tidyverse gsheet
a <- 1 a + 3
Let’s create an object called “ages”, with the age of everyone
ages <- c()
How do we view our ages object?
ages
How do we view our ages object?
ages
[1] 30 26 31 39 45 27 28 22 19 30 35
How do we view just the first element of our ages object?
ages[1]
How do we view just the first element of our ages object?
ages[1]
[1] 30
How do we sort our ages object?
sorted_ages <- sort(ages)
sorted_ages
[1] 19 22 26 27 28 30 30 31 35 39 45
How do we get the minimum, maximum, average age?
min(ages) max(ages) mean(ages)
min(ages)
[1] 19
max(ages)
[1] 45
mean(ages)
[1] 30.18182
How do we visualize our ages object?
hist(ages)
Previously, we looked at a one dimensional object: ages.
But most data is two dimensional: rows and columns.
This is called a data frame.
Let’s play around with some real data.
Let’s create a simple dataframe
www.databrew.cc/frangos.csv
frangos <- databrew::frangos
head(frangos)
# A tibble: 6 x 4 diet chick days grams <chr> <int> <dbl> <int> 1 corn 1 0.192 42 2 corn 1 1.01 51 3 corn 1 4.52 59 4 corn 1 6.72 64 5 corn 1 8.14 76 6 corn 1 9.11 93
Let’s explore.
Brackets: []
-Data coding
-Data sharing
-Practical session
-“Coding” is the act of assigning a (usually numeric) value to a categorical concept.
-Example: Female = 1, Male = 2, Other = 3, Unknown = 4 -Example: Aged 0-5 = 1, Aged 6-18 = 2, Aged 19-45 = 3, Aged 46+ = 4, Unknown = 98
-Saves physical space on paper CRFs
-Saves significant time in paper-to-digital data entry
-Forces categorization (not necessarily good)
-Forces a priori thinking about meaningful categorization
-Saves hard-drive space
-Lots of data capture is now digital
-Categorization is not necesarilly good
-Hard-drive space is rarely a limiting issue -Coding means one more layer between the data and understanding it
-You should have comprehensive data dictionaries: both machine- and human-readable
-Your “levels” should make ordenal/notional sense
-Your categories/codes should be tested prior to deployment
-Automated joinds vs. manual recoding
-Identifying vs non-identifying information -Health vs non-health data -Raw vs processed
-Individual vs aggregated
Ie, turning individual-level data into group-level data
Ie, making individual-level identifiable data non-identifiable
www.databrew.cc/frangos.csv
Need to fill out
Need to fill out
Need to fill out
Need to fill out
Need to fill out
Need to fill out
Need to fill out
Need to fill out
Need to fill out
Need to fill out
We’re going to use the cism package to get weather data for the FQMA weather station (Maputo).
library(cism)
Error in library(cism): there is no package called 'cism'
??get_weather
weather <- get_weather(station = 'FQMA',
start_year = 2010,
end_year = 2016)
Error in get_weather(station = "FQMA", start_year = 2010, end_year = 2016): could not find function "get_weather"
Now that we have our weather data, we can look at it.
head(weather)
Now that we have our weather data, we can look at it.
head(weather)
Error in head(weather): object 'weather' not found
# 1. How many rows are in our data? nrow(weather) # 2. How many columns? ncol(weather) # 3. What are the names of the columns? colnames(weather)
# 1. How many rows are in our data? nrow(weather)
Error in nrow(weather): object 'weather' not found
# 2. How many columns? ncol(weather)
Error in ncol(weather): object 'weather' not found
# 3. What are the names of the columns? colnames(weather)
Error in is.data.frame(x): object 'weather' not found
# 4. What is the date range? range(weather$date) # 5. What is the maximum temperature? max(weather$temp_max) # 6. What is the minimum temperature? min(weather$temp_min) # 7. What is the average temperature? mean(weather$temp_mean)
# 4. What is the date range? range(weather$date)
Error in eval(expr, envir, enclos): object 'weather' not found
# 5. What is the maximum temperature? max(weather$temp_max, na.rm = TRUE)
Error in eval(expr, envir, enclos): object 'weather' not found
# 6. What is the minimum temperature? min(weather$temp_min, na.rm = TRUE)
Error in eval(expr, envir, enclos): object 'weather' not found
# 7. What is the average temperature? mean(weather$temp_mean, na.rm = TRUE)
Error in mean(weather$temp_mean, na.rm = TRUE): object 'weather' not found
Which variables do we have which are numeric and continuous?
How can we visualize these?
Which variables do we have which are numeric and continuous?
temp_max, temp_mean, temp_min, etc…How can we visualize these?
boxplot(weather$temp_mean)
Error in boxplot(weather$temp_mean): object 'weather' not found
hist(weather$temp_mean)
Error in hist(weather$temp_mean): object 'weather' not found
Let’s create a variable called “hot”
weather$hot <- ifelse(weather$temp_max > 30, 'hot', 'not hot')
Error in ifelse(weather$temp_max > 30, "hot", "not hot"): object 'weather' not found
head(weather)
head(weather)
Error in head(weather): object 'weather' not found
table(weather$hot) hot_table <- table(weather$hot) hot_prop_table <- prop.table(hot_table)
hot_table <- table(weather$hot)
Error in table(weather$hot): object 'weather' not found
hot_prop_table <- prop.table(hot_table)
Error in prop.table(hot_table): object 'hot_table' not found
barplot(hot_table)
Error in barplot(hot_table): object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo')
Error in barplot(hot_table, main = "Hot days in Maputo"): object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days')
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days"): object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days',
xlab = 'Temperature')
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days", : object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days',
xlab = 'Temperature',
col = c('red', 'blue'))
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days", : object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days',
xlab = 'Temperature',
col = c('red', 'blue'),
border = 'darkgrey')
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days", : object 'hot_table' not found
Let’s create a plot of date (x-axis) and the maximum temperature
Let’s create a plot of date (x-axis) and the maximum temperature
plot(weather$date,
weather$temp_max)
Error in plot(weather$date, weather$temp_max): object 'weather' not found
Let’s make our plot prettier
Let’s make our plot prettier
plot(weather$date,
weather$temp_max,
type = 'l',
col = 'red',
xlab = 'Date',
ylab = 'Maximum temperature',
main = 'Maximim temperature in Maputo')
Error in plot(weather$date, weather$temp_max, type = "l", col = "red", : object 'weather' not found
We’re going to use the cism package to get weather data for the FQMA weather station (Maputo).
library(cism)
Error in library(cism): there is no package called 'cism'
??get_weather
weather <- get_weather(station = 'FQMA',
start_year = 2010,
end_year = 2016)
Error in get_weather(station = "FQMA", start_year = 2010, end_year = 2016): could not find function "get_weather"
Now that we have our weather data, we can look at it.
head(weather)
Now that we have our weather data, we can look at it.
head(weather)
Error in head(weather): object 'weather' not found
# 1. How many rows are in our data? nrow(weather) # 2. How many columns? ncol(weather) # 3. What are the names of the columns? colnames(weather)
# 1. How many rows are in our data? nrow(weather)
Error in nrow(weather): object 'weather' not found
# 2. How many columns? ncol(weather)
Error in ncol(weather): object 'weather' not found
# 3. What are the names of the columns? colnames(weather)
Error in is.data.frame(x): object 'weather' not found
# 4. What is the date range? range(weather$date) # 5. What is the maximum temperature? max(weather$temp_max) # 6. What is the minimum temperature? min(weather$temp_min) # 7. What is the average temperature? mean(weather$temp_mean)
# 4. What is the date range? range(weather$date)
Error in eval(expr, envir, enclos): object 'weather' not found
# 5. What is the maximum temperature? max(weather$temp_max, na.rm = TRUE)
Error in eval(expr, envir, enclos): object 'weather' not found
# 6. What is the minimum temperature? min(weather$temp_min, na.rm = TRUE)
Error in eval(expr, envir, enclos): object 'weather' not found
# 7. What is the average temperature? mean(weather$temp_mean, na.rm = TRUE)
Error in mean(weather$temp_mean, na.rm = TRUE): object 'weather' not found
Which variables do we have which are numeric and continuous?
How can we visualize these?
Which variables do we have which are numeric and continuous?
temp_max, temp_mean, temp_min, etc…How can we visualize these?
boxplot(weather$temp_mean)
Error in boxplot(weather$temp_mean): object 'weather' not found
hist(weather$temp_mean)
Error in hist(weather$temp_mean): object 'weather' not found
Let’s create a variable called “hot”
weather$hot <- ifelse(weather$temp_max > 30, 'hot', 'not hot')
Error in ifelse(weather$temp_max > 30, "hot", "not hot"): object 'weather' not found
head(weather)
head(weather)
Error in head(weather): object 'weather' not found
table(weather$hot) hot_table <- table(weather$hot) hot_prop_table <- prop.table(hot_table)
hot_table <- table(weather$hot)
Error in table(weather$hot): object 'weather' not found
hot_prop_table <- prop.table(hot_table)
Error in prop.table(hot_table): object 'hot_table' not found
barplot(hot_table)
Error in barplot(hot_table): object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo')
Error in barplot(hot_table, main = "Hot days in Maputo"): object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days')
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days"): object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days',
xlab = 'Temperature')
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days", : object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days',
xlab = 'Temperature',
col = c('red', 'blue'))
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days", : object 'hot_table' not found
barplot(hot_table,
main = 'Hot days in Maputo',
ylab = 'Number of days',
xlab = 'Temperature',
col = c('red', 'blue'),
border = 'darkgrey')
Error in barplot(hot_table, main = "Hot days in Maputo", ylab = "Number of days", : object 'hot_table' not found
Let’s create a plot of date (x-axis) and the maximum temperature
Let’s create a plot of date (x-axis) and the maximum temperature
plot(weather$date,
weather$temp_max)
Error in plot(weather$date, weather$temp_max): object 'weather' not found
Let’s make our plot prettier
Let’s make our plot prettier
plot(weather$date,
weather$temp_max,
type = 'l',
col = 'red',
xlab = 'Date',
ylab = 'Maximum temperature',
main = 'Maximim temperature in Maputo')
Error in plot(weather$date, weather$temp_max, type = "l", col = "red", : object 'weather' not found